medicare <-read_csv(here("c17/data/medicare.csv"), show_col_types =FALSE) |>mutate(across(where(is_character), as_factor),subject =as.character(subject), insurance =fct_relevel(insurance, "no", "yes"),logvisits =log(visits +1)) ## needed because some have 0 visitsset.seed(432)med_split <-initial_split(medicare, prop =0.75)med_train =training(med_split)med_test =testing(med_split)
The medicare data
medicare
# A tibble: 4,406 × 9
subject visits hospital health chronic sex school insurance logvisits
<chr> <dbl> <dbl> <fct> <dbl> <fct> <dbl> <fct> <dbl>
1 1 5 1 average 2 male 6 yes 1.79
2 2 1 0 average 2 female 10 yes 0.693
3 3 13 3 poor 4 female 10 no 2.64
4 4 16 1 poor 2 male 3 yes 2.83
5 5 3 0 average 2 female 6 yes 1.39
6 6 17 0 poor 5 female 7 no 2.89
7 7 9 0 average 0 female 8 yes 2.30
8 8 3 0 average 0 female 8 yes 1.39
9 9 1 0 average 0 female 8 yes 0.693
10 10 0 0 average 0 female 8 yes 0
# ℹ 4,396 more rows
Reiterating the Goal
Predict visits using these 6 predictors…
Predictor
Description
hospital
# of hospital stays
health
self-rated health (poor, average, excellent)
chronic
# of chronic conditions
sex
male or female
school
years of education
insurance
subject (also) has private insurance? (yes/no)
First Four models (fit last class)
mod_1 <-glm(visits ~ hospital + health + chronic + sex + school + insurance,data = med_train, family ="poisson")mod_2 <-glm.nb(visits ~ hospital + health + chronic + sex + school + insurance,data = med_train)mod_3 <-zeroinfl(visits ~ hospital + health + chronic + sex + school + insurance,data = med_train)mod_4 <-zeroinfl(visits ~ hospital + health + chronic + sex + school + insurance,dist ="negbin", data = med_train)
First Four Models, augmented
mod_1_aug <-augment(mod_1, med_train, type.predict ="response")mod_2_aug <-augment(mod_2, med_train, type.predict ="response")mod_3_aug <- med_train |>mutate(".fitted"=predict(mod_3, type ="response"),".resid"=resid(mod_3, type ="response"))mod_4_aug <- med_train |>mutate(".fitted"=predict(mod_4, type ="response"),".resid"=resid(mod_4, type ="response"))
The hurdle model is a two-part model that specifies one process for zero counts and another process for positive counts. The idea is that positive counts occur once a threshold is crossed, or put another way, a hurdle is cleared. If the hurdle is not cleared, then we have a count of 0.
The first part of the model is typically a binary logistic regression model. This models whether an observation takes a positive count or not.
The second part of the model is usually a truncated Poisson or Negative Binomial model. Truncated means we’re only fitting positive counts, and not zeros.
mod_5: Poisson-Logistic Hurdle Model
Fitting a Hurdle Model / Poisson-Logistic
In fitting a hurdle model to our medicare training data, the interpretation would be that one process governs whether a patient visits a doctor or not, and another process governs how many visits are made.
The mod_5 model
mod_5 <-hurdle(visits ~ hospital + health + chronic + sex + school + insurance,dist ="poisson", zero.dist ="binomial", data = med_train)mod_5
Call:
hurdle(formula = visits ~ hospital + health + chronic + sex + school +
insurance, data = med_train, dist = "poisson", zero.dist = "binomial")
Count model coefficients (truncated poisson with log link):
(Intercept) hospital healthpoor healthexcellent
1.29892 0.16041 0.30243 -0.28116
chronic sexfemale school insuranceyes
0.09697 0.05611 0.02332 0.09351
Zero hurdle model coefficients (binomial with logit link):
(Intercept) hospital healthpoor healthexcellent
-0.2998 0.3044 0.1114 -0.3705
chronic sexfemale school insuranceyes
0.4970 0.3652 0.0637 0.6625
mod_5 summary
summary(mod_5)
Call:
hurdle(formula = visits ~ hospital + health + chronic + sex + school +
insurance, data = med_train, dist = "poisson", zero.dist = "binomial")
Pearson residuals:
Min 1Q Median 3Q Max
-5.4472 -1.1621 -0.4769 0.5698 24.3403
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.298920 0.029612 43.865 < 2e-16 ***
hospital 0.160410 0.006781 23.656 < 2e-16 ***
healthpoor 0.302432 0.020026 15.102 < 2e-16 ***
healthexcellent -0.281162 0.035755 -7.864 3.73e-15 ***
chronic 0.096971 0.005417 17.901 < 2e-16 ***
sexfemale 0.056109 0.014933 3.757 0.000172 ***
school 0.023321 0.002138 10.907 < 2e-16 ***
insuranceyes 0.093508 0.019789 4.725 2.30e-06 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.29978 0.16500 -1.817 0.069240 .
hospital 0.30445 0.10235 2.975 0.002934 **
healthpoor 0.11138 0.18993 0.586 0.557605
healthexcellent -0.37054 0.16200 -2.287 0.022179 *
chronic 0.49699 0.05112 9.722 < 2e-16 ***
sexfemale 0.36524 0.10116 3.610 0.000306 ***
school 0.06370 0.01382 4.611 4.01e-06 ***
insuranceyes 0.66251 0.11789 5.620 1.91e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 14
Log-likelihood: -1.222e+04 on 16 Df
Let’s fit a linear regression (mod_0: note log transformation) to go along with the Poisson regression (mod_1) we fit last time.
mod_0 <-lm(log(visits+1) ~ hospital + health + chronic + sex + school + insurance, data = med_train)mod_1 <-glm(visits ~ hospital + health + chronic + sex + school + insurance, data = med_train, family ="poisson")
Linear Model Coefficients?
## linear modeltidy(mod_0) |>gt() |>fmt_number(decimals =3) |>tab_options(table.font.size =20)
Selecting non-linear terms after Spearman \(\rho^2\)
Spearman \(\rho^2\) plot
plot(spearman2(visits ~ hospital + health + chronic + sex + school + insurance, data = med_train))
Reiterating the Goal
This is the order of the predictors (chronic highest) on the Spearman \(\rho^2\) plot from the previous slide.
Predictor
Description
chronic
# of chronic conditions (all values 0-8)
hospital
# of hospital stays (all values 0-8)
health
self-rated health (poor, average, excellent)
insurance
subject (also) has private insurance? (yes/no)
school
years of education
sex
male or female
What might we do?
chronic is a count (all values 0-8), then a gap to…
hospital also quantitative, also a count (0-8)
health is a 3-category factor
We might:
include a restricted cubic spline with 4-5 knots in chronic
include a rcs with fewer knots in hospital
include an interaction between health and chronic or perhaps health and hospital
Could we build an ols() fit?
Splines sometimes crash with discrete predictors (like counts.)
For these data, it turns out that even a 3-knot spline in hospital fails (if we already have the four-knot spline in chronic), but the ols() function will let us add both interactions we’re considering.
d <-datadist(medicare); options(datadist ="d")mod_toobig <-ols(log(visits +1) ~rcs(chronic, 4) + hospital * health + chronic %ia% health + sex + school + insurance, data = med_train)
Why is this model “too big”?
mod_toobig
Linear Regression Model
ols(formula = log(visits + 1) ~ rcs(chronic, 4) + hospital *
health + chronic %ia% health + sex + school + insurance,
data = med_train)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 3304 LR chi2 664.03 R2 0.182
sigma0.8363 d.f. 13 R2 adj 0.179
d.f. 3290 Pr(> chi2) 0.0000 g 0.444
Residuals
Min 1Q Median 3Q Max
-2.42109 -0.55490 0.08359 0.56662 3.07394
Coef S.E. t Pr(>|t|)
Intercept 0.5590 0.0575 9.73 <0.0001
chronic 0.3011 0.0546 5.52 <0.0001
chronic' -0.2051 0.2579 -0.80 0.4264
chronic'' 0.2159 0.6311 0.34 0.7323
hospital 0.2475 0.0249 9.95 <0.0001
health=poor 0.5914 0.0952 6.21 <0.0001
health=excellent -0.2022 0.0730 -2.77 0.0057
chronic * health=poor -0.0931 0.0335 -2.78 0.0054
chronic * health=excellent -0.0213 0.0565 -0.38 0.7058
sex=female 0.1088 0.0297 3.66 0.0003
school 0.0257 0.0041 6.20 <0.0001
insurance=yes 0.2353 0.0375 6.28 <0.0001
hospital * health=poor -0.2053 0.0421 -4.88 <0.0001
hospital * health=excellent 0.1623 0.1493 1.09 0.2769
Uh, oh.
plot(nomogram(mod_toobig, fun = exp, funlabel ="Visits + 1"))
A more reasonable option?
d <-datadist(medicare); options(datadist ="d")mod_new <-ols(log(visits +1) ~rcs(chronic, 4) + hospital + health + chronic %ia% health + sex + school + insurance, data = med_train)
What does this mod_new show?
mod_new
Linear Regression Model
ols(formula = log(visits + 1) ~ rcs(chronic, 4) + hospital +
health + chronic %ia% health + sex + school + insurance,
data = med_train)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 3304 LR chi2 637.75 R2 0.176
sigma0.8393 d.f. 11 R2 adj 0.173
d.f. 3292 Pr(> chi2) 0.0000 g 0.435
Residuals
Min 1Q Median 3Q Max
-3.06891 -0.54950 0.08035 0.56946 3.04632
Coef S.E. t Pr(>|t|)
Intercept 0.5743 0.0576 9.97 <0.0001
chronic 0.3036 0.0548 5.55 <0.0001
chronic' -0.1710 0.2588 -0.66 0.5089
chronic'' 0.1165 0.6331 0.18 0.8540
hospital 0.1799 0.0199 9.02 <0.0001
health=poor 0.5437 0.0951 5.72 <0.0001
health=excellent -0.1940 0.0724 -2.68 0.0074
chronic * health=poor -0.1199 0.0331 -3.62 0.0003
chronic * health=excellent -0.0163 0.0563 -0.29 0.7718
sex=female 0.1051 0.0298 3.53 0.0004
school 0.0256 0.0042 6.16 <0.0001
insurance=yes 0.2307 0.0376 6.14 <0.0001
How many df did we add here?
anova(mod_new)
Analysis of Variance Response: log(visits + 1)
Factor d.f. Partial SS MS
chronic (Factor+Higher Order Factors) 5 189.349521 37.8699042
All Interactions 2 9.251314 4.6256570
Nonlinear 2 9.483346 4.7416730
hospital 1 57.342322 57.3423218
health (Factor+Higher Order Factors) 4 39.675076 9.9187689
All Interactions 2 9.251314 4.6256570
chronic * health (Factor+Higher Order Factors) 2 9.251314 4.6256570
sex 1 8.763370 8.7633703
school 1 26.694549 26.6945488
insurance 1 26.545793 26.5457929
TOTAL NONLINEAR + INTERACTION 4 31.942728 7.9856821
REGRESSION 11 493.787139 44.8897399
ERROR 3292 2319.211257 0.7044992
F P
53.75 <.0001
6.57 0.0014
6.73 0.0012
81.39 <.0001
14.08 <.0001
6.57 0.0014
6.57 0.0014
12.44 0.0004
37.89 <.0001
37.68 <.0001
11.34 <.0001
63.72 <.0001
What does this ols() fit look like?
plot(summary(mod_new))
What does this ols() fit look like?
How’s the nomogram?
plot(nomogram(mod_new, fun = exp, funlabel ="Visits + 1"))
Can we fit a Poisson model with a function from rms?
The Glm() function in rms
d <-datadist(medicare); options(datadist ="d")mod_1_Glm <-Glm(visits ~ hospital + health + chronic + sex + school + insurance, data = med_train, family =poisson())
and we could have used rcs() or polynomials or interactions if we wanted to do so.